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Abstract 

While stochastic variational inference is relatively well known for scal¬ 
ing inference in Bayesian probabilistic models, related methods also offer 
ways to circumnavigate the approximation of analytically intractable ex¬ 
pectations. The key challenge in either setting is controlling the variance 
of gradient estimates: recent work has shown that for continuous latent 
variables, particularly multivariate Gaussians, this can be achieved by us¬ 
ing the gradient of the log posterior. In this paper we apply the same idea 
to gamma distributed latent variables given gamma variational distribu¬ 
tions, enabling straightforward “black box” variational inference in models 
where sparsity and non-negativity are appropriate. We demonstrate the 
method on a recently proposed gamma process model for network data, 
as well as a novel sparse factor analysis. We outperform generic sampling 
algorithms and the approach of using Gaussian variational distributions 
on transformed variables. 


1 Introduction 


Bayesian probabilistic models offer a clean, interpretable methodology for ap¬ 
plied statistical analysis. However, inference remains a challenge both in terms 
of ease of implementation and scalability. Ease of implementation is important 
so that practitioners can construct models tailored specifically to their applica¬ 
tion, rather than being forced to choose from a small set of pre-existing models. 


While various software packages, such as Infer.NET (Minka et al. 2010), Win- 


BUGS (Lunn et al. 2000), Church (Goodman et al. 2012) and more recently, 


STAN (Stan Development Team [2014 ), have been designed explicitly to address 


this problem, they do not currently scale to large real world datasets. 


Stochastic variational inference (SVI) methods (Hoffman et al., 2010 2013) 


follow the traditional variational Bayes approach of converting an intractable 
integration into a optimization problem. However, where VB would usually 
proceed by the well known coordinate ascent updates on the variational lower 
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bound (Jordan et al. 19991, SVI utilizes the key idea of stochastic gradient 


descent: that it is enough to follow noisy, but unbiased, estimates of the gradient 
(Robbins & Monro 1951). If these noisy gradients can be computed much more 
cheaply than full gradients, e.g. by subsampling the data, then more rapid 
convergence is typically possible and the volume of data which can be handled 
is greatly increased. 

The observation that noisy, unbiased gradients can be used in variational 
inference suggests another idea: instead of analytically calculating the required 
expectations and gradient of the lower bound can we just use Monte Carlo? 
The challenge in applying this idea is to keep the variance of the Monte Carlo 
gradient estimates low without requiring a computationally infeasible number of 
samples. Various “tricks” have been proposed to achieve this, including control 


variates (Paisley et al. 2012), stochastic linear regression (Salimans & Knowles 


20131 and using the factor graph structure of the model (Ranganath et al. 


20131. We focus on a recently proposed solution for continuous latent variables 


proposed independently by Salimans & Knowles (20131 and Kingma & Welling 


(2013) which utilizes just the gradient of the log posterior, which we refer to as 
stochastic gradient variational Bayes (SGVB). While both papers demonstrated 
the effectiveness of this approach for multivariate Gaussian variables, whether 
it is equally useful for latent variables with very different distributions remains 
an open question. In this paper we investigate using gamma approximating 
distributions. Gompared to the Gaussian case, gamma r.v.s represent a natural 
step in the direction of more structured models: despite being continuous they 
can encode sparsity using suitably small shape parameters, while also enforcing 
non-negativity, which is appropriate in many settings. In addition gamma r.v.s 
also underly many of the most commonly used Bayesian nonparametric priors 
such as the Dirichlet process. 

The variational autoencoder Kingma & Welling (2013) uses a variational 
inference methodology where the approximate posterior is a function, known 
as the recognition model, of the observed data. This allows extremely scalable 
training using stochastic gradient descent analogously to a standard autoen¬ 
coder. While having certain advantages, this approach can be sensitive to the 
choice of recognition model, has only been demonstrated for Gaussian latent 
variables, does not straightforwardly handle missingness in the observations and 
only performs MLE over the model parameters. 


Related methodology has very recently been incorporated into Stan (Ku- 


cukelbir et al. 2015). Their approach is to always use a fully factorized Gaus¬ 


sian variational posterior but to reparameterize such that the space of the r.v.s 
is always the reals. For r.v.s constrained to be positive for example, this cor¬ 
responds to using a log-normal variational posterior. Our experiments here 
suggest that explicitly using gamma variational posteriors, at least when the 
priors are gamma, is preferable. 

In Section]^ we review stochastic gradient variational Bayes (SGVB), show 
how to leverage the gradient wrt to the log joint and present the necessary 
derivations for gamma r.v.s. We present two models in Sectio n [3| wh ich we use 
as test cases. The first is the infinite edge partition model (Zhou 2015) for 
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network data, the second a novel gamma process factor analysis model (GPFA) 
for arbitrary continuous data. In Section|^we present promising results for both 
models on synthetic and real world data and conclude in Section with some 
potential future directions. 


2 Methods 

In this section we review variational inference, show how the required gradients 
can be approximated using Monte Carlo and then turn to the particular case of 
gamma r.v.s. 

2.1 Variational inference 

Let the normalized distribution of interest be p(x) = f{x)/Z. Typically p is 
the posterior, / is the joint and Z is the marginal likelihood (evidence). We use 
Jensen’s inequality to lower bound 

log Z = log J /(x)dx = log J g(x)^|^dx> j g(x) log ^|^dx =: (1) 

where q represents the variational posterior. We can ask what error we are 
making between the true Z and the Evidence Lower BOund (ELBO) T[q\-. 

logZ - T[q] = J g(x) log =: KL{q\\p) = -H[q{x)] - J g(x) logp(x)dx, 

( 2 ) 

where KL{q\\p) is the KL divergence and H[q{x)] = — f q{x)logq{x)dx is the 
entropy. In general we can evaluate !F{q] but not the KL itself, since this would 
require knowing Z. By maximising the lower bound we will minimize the 
KL divergence. The KL divergence is strictly positive for q ^ p and equal to 
0 only for q = p. As a result finding the general q which minimizes the KL 
divergence is no easier than the original inference task, which we assume is 
intractable. The usual strategy therefore is to place simplifying constraints on 
q, the most popular, due to its simplicity, being the mean held approximation. 
We will take the approach of choosing q to have a specihc parametric form, 
indexed by 6. Typically qg will be in the exponential family: in this paper in 
particular, qg will be a product of gamma distributions. 

2.2 SGVB for continuous latent variables 

To ht qg we will maximize J^[qg] wrt to 9, which requires estimating the gradient 

= VeE,Jlog/(x) - logge(x)] (3) 

This form is not easily amenable to Monte Carlo estimation because of the 
dependence of qg on 9. One approach is to use the identity VeEqg[L(x)] = 
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Egg[L(x)V6ilogge(x)] where L{x) = log/(x) — log 96 i(x), but this typically has 
high variance. Instead, assume we can find a random variable z ^ 7r(z) such 
that X = '!/'(z, 0) has the same distribution as x ~ gg, then 

VeE,jL(x)] =E,(,)[Ve^(z,0)VxL(/(z,d))]. (4) 

Since 7r(z) has no dependence on 9 the RHS expression is straightforward to 
approximate by Monte Carlo, 

1 ^ 

VeE,jL(x)] « Ve^/>(z(®\6i)Vxi(x(®)), where z^®) ~7r,x(®) = 6»). 

S—1 

In fact, this estimator generally has low enough variance that we can simply use 
5 = 1 . 

For Gaussian random variables x an obvious choice is 7r(z) = N{0,I) and 
'(/'(z, {m, V}) = m + Vsz where {m,V} are the mean and (co)variance respec¬ 
tively. For gamma random variables no such simple transformation exists, so 
we resort to the generic CDF transform instead. For any random variable x 
with CDF Fg{x) we can sample x as 

z ~ t/[0,1],a; = F7^(z) =:-0(z,6») (5) 

where U[0, 1] is the uniform distribution on [0,1]. We can differentiate ip{z,9) 
with respect to 6 as 

=- , . (6 

fe[x) 

where fe{x) is the pdf of x. 

2.3 Gamma variational distributions 

For a gamma latent variable with shape a and rate b 

Fa,b[x) = r ( 7 ) 

do ^ \ 0 ,} 

It is straightforward to differentiate this expression wrt b and use Equation]^ 
to obtain Vbipiz, a,b). However the result is easier to obtain by noting that 
= Fa,i iz)/b and so 

\/bHz,a,b) = V,E-i(z) = ^bF-j{z)/b = -F-j{z)/b^ = -x/b (8) 

Unfortunately the gradient wrt to the shape a has no analytical form in terms 
of commonly available special functions. Depending on the order of magnitude 
of a different approaches can be used to accurately and efficiently approximate 
Va'>P{z, a, b). For moderate values of a we use a finite difference approximation 
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where e is a small positive constant. We use of the high numerical precision 
of the gaminv Matlab function (the Boost C++ library also implements such a 
function), which calculates F~^ using an iterative solver. 

For small values of a <i; 1 gemiinv often fails to converge. This regime is 
important because it corresponds to the gamma distribution’s ability to model 
sparsity. Fortunately, in this regime the asymptotic approximation Fa^iix) « 
becomes increasingly accurate, so that 

F-i(z)«(zaF(a))V6 (10) 

For a < 1 and (1 — 0.Odz) log(a) < —0.42 we use Equation [Io| to efficiently obtain 
both X and o, b) without expensive calls to gemiinv, whilst keeping the 

absolute relative error below 10“'^. Finally for large a S> 1 (we use a > 1000) 
the gamma distribution is well approximated by a Gaussian with matched mean 
and variance, i.e. , a, 6) « (a + sfaz'^jh and Vai^iz', a, 6) « (1 + z' j\/a)lb 
where z' ^ N{0,1 ). 


2.4 Optimization 

The shape and rate parameters for the gamma distribution are of course re¬ 
quired to be positive. To cope with this we use the reparameterisation r{6) = 
log(l + exp ( 0 )) for both the shape and rate to avoid performing constrained 
optimisation. Pseudocode is shown in Algorithm for the basic algorithm. We 
also experimented with incorporating momentum, and using AdaGrad (Duchi 
et al. 20111, RMSprop ( Tieleman fc Hinton[ 2012[ ) or AdaDelta (Zeiler 20121 
to set the learning rate. Momentum involves maintaining an additional velocity 
vector V which is updated as u •<— A 5 + (1 — X)v where g is the gradient and 
A € [0,1] is the momentum parameter, v is then used in the place of g when 
updating the parameters. Our implementation of AdaGrad uses a step-size 

qb) = 0.1/ ^10“® + 5?) where gt is the gradient at step t. RMSprop is 

similar in spirit to AdaGrad: we maintain a running average m <— O.lg^ + 0.9m, 
and use a step-size 7 ^*^ = 0.01/ (l x 10“® + . AdaDelta is a heuristic which 

tries to maintain progress in later stages of the optimization by keeping the same 
running average of squared gradients as RMSprop, nig •(— pg^ + (1 ~ (with 
p S [0,1]), as well as mg ^ p{^0Y + (1 — p)me (where A6 is the update in 
parameter space), and using a step-size 7 = \f{niQ + e)/ \f{nig + e), where we 
use e = 10-4. 


3 Models 

In this section we briefly outline the models we will use to assess the algorithm. 
We choose models which only involve gamma latent variables, but emphasize 
that models involving both Gaussian and gamma latent variables would also be 
straightforward to implement. 
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Algorithm 1 Gamma stochastic gradient variational Bayes 
Initialize t = 0, a, b. 
a = r~^ (a), f3 = r“^(b) 

repeat 

Sample Zd ^ U[0, 1] 

Set Xd = F~^^^{zd) according to Section 2.3 
Set g = Vx[log/(x) - log( 3 'a,b(x)] 

Set 5 “ = 9dSaF-l,^{zd)/{l + 

Set 5 ^ = 9dSbF-l^,^{zd)/{l + e-P'^) 

Compute step size 7 ^*^ (e.g. using AdaDelta on [g“,g^])) 

ad^ ad + 

13d+ 1^*^ 9^ 
a = r{a.), b = r{f3) 
t t “t" 1 
until convergence 


3.1 Infinite edge partition model 

There has been considerable recent interest in probabilistic modelling of net¬ 


work data, typically represented as an undirected graph (Kemp & Tenenbaum 


2006 Blundell & Teh 20131,. In a social network nodes will represent individ¬ 


uals and edges friendships, or in a protein interaction network nodes represent 
proteins and edges physical interaction. Let Y S {0,1}^^^ represent the bi¬ 
nary adjacency matrix of the graph: y^j = yji indicates whether there is a link 
between nodes i and j. Many models have been proposed to uncover the latent 
structure in such data, but we will focus on the recent infinite edge partition 


model (EPM, Zhou, 20151, which specifies 


P{y^j = l\W) = 1-exp -y^TkW^kWjk 


( 11 ) 


where IT is a A x A matrix of positive reals and r is a A-vector of re¬ 
als. This link function can be interpreted as summing over latent variables 
Sijk Poisson{rkWikWjk) and taking = I[0 < ^ijk]- This link func¬ 

tion has two advantages over logistic link functions: i) it is appropriate for 
sparse graphs since P{yij = 3\W) is small when IT « 0 , ii) the corresponding 
likelihood can be evaluated with computational cost linear in the number of 
observed present and missing edges (as noted by Morup et al. (20111), which 
is typically orders of magnitude smaller than AT To see this note that the 
likelihood is 


^ log (1 - + (1 - 

i>j 

= - e-P'^) +P^J]+Y^{1 - ( 12 ) 

i>j i>j i>j 
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where = 1 iff edge ij is not missing and 0 o.w., and pij = 'u^ikWjk- The 
first sum only involves non-missing existing edges, the second sum only missing 
edges, and the third sum can be calculated efficiently in 0{NK) as 

— 2 (13) 

i>j i>j k k i 

where w.k = J^i'^^k- 

To complete the prior specification we use 

W,k\a^,c,^G{a,,c,), a, - G(0.01,0.01), c,-G(l,l), 

?'fcl7o,co ~ G(7o//f,co), 7 o~G(1,1), co--G(1,1) (14) 

where the distribution on is a hnite K approximation to the gamma process. 


3.2 Gamma process factor analysis 


Factor analysis models are appealing for finding latent structure in high dimen¬ 
sional data. Observed data samples y„ S ,n = 1... N are modeled as 


y„|x„ - iV(Wx„,cr^I) 


(15) 


where typically x„ ~ 7V(0,I). Many approaches exist to fitting such models. 
From a Bayesian viewpoint, if W is given a Gaussian prior then Gibbs sam¬ 
pling x\W,— and W\x,— is straightforward and conjugate, although even in 
this simple setting the strong posterior dependencies between x and W can be 
problematic for convergence and mixing. We consider placing a Gamma prior 
on the elements of W, thereby enforcing positivity. Such “semi”-nonnegative 


matrix factorization (NMF) is closely connected to fc-means clustering (Ding 
et al. 20101, encouraging interpretable solutions while still allowing arbitrary 


real valued data, unlike classical NMF which requires positive data. Since our 
SGVB algorithm does not require conjugacy, we can integrate out x„ to give 


7V(0, WW^ -b cr^I) 


(16) 


The log likelihood is then 

L = ^ log |K| - i ^ yf K-iy* = y log |K| - ^tr(VY^K-^) (17) 

i 

where K = WW^-bcr^I and Y = [yi, ...,yAr]. Differentiating w.r.t. W we have 
BT 

— = TVK-^W -b K-iyr'^K-^W (18) 

oW 

After precomputing the per iterations operations are 0{D^), with no 

dependence on N. Similarly to the EPM we use a hierarchical gamma process 
prior construction for W: 


Wdk\rk,l ^ G{jrk,-f), 7-G(1,1), 

?’fc|7o,co ~ G(7o/Ar,co), 7 o~G(l,l), co--G(l,l). (19) 
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4 Results 


We present results on both synthetic and real world data for the two mod¬ 
els described in Section with inference performed using our gamma SGVB 
algorithm. 

4.1 Infinite edge partition model 

We initially investigated what choices of step size adaptation and momentum 
were most compatible with the gamma SGVB algorithm (Figure [^, at least 
in the context of the EPM. All methods performed comparably apart from 
RMSprop which performed poorly in this setting. Adadelta with momentum, 
despite not achieving the fastest initial improvement, obtained the best ELBO 
after 1000 iterations, likely because Adadelta continues to make progress af¬ 
ter Adagrad and standard SGD have stopped, and because momentum helps 
smooth over the stochasticity of the gradient estimates. 




Figure 1: Performance of various learning rate adaption methods, including 
using momentum, for GammaSGVB. Left: Negative ELBO (lower is better) 
with ^ iterations. Right: Einal ELBO after 1000 iterations. 


Since Adadelta seemed the most promising of the “automatic” methods we 
sought to validate the claim that the optimisation performance is not particu¬ 
larly sensitive to the choice of p and the momentum A. Eigurej^shows the ELBO 
achieved using Adadelta after 1000 iterations for varying p and A. Eor 1 — A in 
a range from 0.3 to 0.03 and p across the full range tested (0.684 to 0.99) the 
performance is very similar. Consider 1 — A = 0.1 i.e. A = 0.9: this is roughly 
equivalent to using information from the last 1/0.1 = 10 samples to calculate 
the gradient, which seems intuitively reasonable given these are independent 
samples from q. In contrast 1 — A = 1 means that only the current gradient 
is used (i.e. no momentum) which we see degrades performance, implying the 
gradient estimates are then somewhat too noisy. 

In order to assess performance quantitatively we compare to the MCMC 
implementation from Zhou (2015), , and the infinite relational model, at link 














prediction on the NIPS N = 234 dataselQ We attempted to compare to Stan, 


using Automatic Differentiation Variational Inference (AVDI, Kucukelbir et al. 


20151, but the gradient evaluations were always nain at initialization. We use 10 


training-test splits taking 20% of pairs as test data, and report test set AUCs 
for varying truncation levels K (Figure]^ note that the IRM is not truncated 
so its performance is equal at every K). While the carefully engineered MCMC 
algorithm consistently performs best (particularly for larger truncation levels), 
gammaSGVB still improves over the IRM. We include two alternative “black 
box” methods: MAP inference using gradient descent, and “NormSGVB” which 
is the equivalent algorithm to gammaSGVB but using a fully factorized normal 
distribution and using the reparameterization r{9) — log(l-|-exp (9)) to maintain 
non-negativity of the parameters, analogously to the approach used for SGVB 
in Stan. Both perform poorly, especially for larger truncation levels. In terms of 
runtime MCMC takes on average 30% longer to run than gammaSGVB, but we 
emphasize that the MCMC implementation is tuned for this model, including 
for example model specific mex functions. By contrast, just 200 iterations of 
Hamiltonian Monte Carlo, which is “black box” in the same sense as our method 
of requiring only gradients, takes an order of magnitude longer than SGVB (310s 
vs 19s for K = 10) and still gives inferior performance (average test set AUC of 
0.71). 
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Figure 2: Evidence lower bound 
(/lOOO) for the edge partition 
model (EPM) on the NIPS dataset, 
achieved after 1000 iterations us¬ 
ing Adadelta with different values of 
momentum and p. 


Figure 3: Test set AUC for varying 
truncation level K for the EPM on 
the NIPS dataset, across 10 train¬ 
ing/test splits. 


4.2 Gamma process factor analysis 

We first test our implementation of GPFA using synthetic data. We fix the 
number of dimensions D = 50, latent factors AT = 10 and vary the sample 

'http://chechiklab.biu.ac.ll/-gal/data.html 
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size N from 10 to 10^ to assess how the method copes with increasing sample 
size. The true factor loading matrix is sampled elementwise from the mixture 
0 . 8^0 + 0 . 2 [/[ 0 , 1 ], i.e. each element is non-zero with probability 0 . 2 , and those 
elements are uniform on [0,1]. The noise variance is 0.1, and the true latent 
factors x„ ~ A^(0,I). We compare to an MCMC implementation of the Indian 


Buffet Process based Nonparametric Sparse Factor Analysis (NSFA, Knowles 
fc Ghahramani[ 2011) and the sparse PC A (SPCA, Zou et al., 2006| ) algorithm 


implemented in the SpaSM toolbox (Sjostrand et al., 2012 1 . We allow SPCA to 
“cheat” by choosing the regularization parameter which minimizes the recon¬ 
struction error. To assess the recovery of the factor loadings W we compute 


the Amari error (Amari et al. 1996) which is invariant to permutation and 
scaling of the factor loadings. For small sample sizes N < 200, we see that 
NSFA typically slightly outperforms GPFA (Figure]^, presumably because the 
spike and slab prior better matches the true data generating mechanism. How¬ 
ever, as N increases the performance of NSFA actually degrades for the same 
number of MCMC iterations (1000), because convergence and mixing becomes 
problematic. In contrast the ability to integrate out the latent factors X when 
using gammaSGVB means that the inference problem becomes easier rather 
than harder as the sample size increases. SPCA is consistently outperformed by 
GPFA, suggesting that the LI regularization is not sufficient to reconstruct the 
factor loadings successfully, a finding that agrees with those in |Mohamed et alT] 
(2012). The computational cost of GPFA is also much lower than for NSFA 
because of the easily vectorized operations, and as noted in Sect ion [3)2| GPFA’s 


runtime has no dependence on N. The runtime of SPGA is approximately linear 
in N, so while it is considerably faster than GPFA for small A^, by A^ = 10"^ 
SPCA is actually slower. 





Figure 5: Perplexity on 
CyTOF data with in¬ 
creasing training sample 
size. For N = 100, the 
perplexity for the empiri¬ 
cal covariance is — 10 ^^. 


Figure 4: Results on synthetic data for GPFA. Left: 
Amari error for reconstructing the factor loading ma¬ 
trix. Right: run time (1000 iterations/samples for 
GPFA/NSFA). 


We apply GPFA to CyTOF (Bendall et al. 2011) data. CyTOF is a novel 


high through-put technology capable of measuring up to 40 protein abundance 
levels in thousands of individual cells per second. Specific proteins are tagged 
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using heavy metals which are measured using time-of-flight mass spectrometry. 
The sample we analyze consists of human immune cells, so representing the 
heterogeneity between cells is relevant for understanding disease response. Our 
dataset has N = 5.3 x 10® cells and D = AO protein expression levels. We run 
GPFA for 3000 iterations using Adadelta(p = 0.9, e = 1 x 10“^), K = AO and a 
prior l/cr^ ~ G(.l, .1) on the noise variance. Runtime is around 10 seconds on a 
quad-core 2.5GHz i7 MacBook Pro. To assess performance we split the dataset 
into a training and test set. Having fit the model on the training data, we 
calculate the perplexity (average negative log likelihood over test data points) 
of the remaining (test) data under the learnt model by drawing S = 100 samples 
, ^ q, and obtaining the expected covariance matrix, 

1 ^ 

Cov{y) = - ^ + al^I. (20) 


We compare to two simple alternatives: using the maximum likelihood estimator 
(i.e. the empirical covariance), and Ledoit-Wolfe shrinkage (Ledoit & Wolf 


20031. We see that for fewer than around N = 2000 training points GPFA 
outperforms the empirical covariance or Ledoit-Wolfe. In real datasets it is 
often the case that N is not significantly larger than D, or even N < D (usually 
referred as the “large p, small n” regime), so GPFA’s strong performance for 
smaller sample sizes is valuable. Finally in Figure]^ we show the empirical and 
estimated covariances, and the expected posterior factor loading matrix. 



Figure 6: Estimated covariance structure on GyTOF data. GPFA models 
the structure well, whilst regularizing the off diagonal components in partic¬ 
ular. Left: empirical covariance. Middle: covariance estimated under GPFA. 
Right: top 10 latent factor loadings. These are easier to interpret than the 
usual PGA loadings because of the enforced non-negativity. 
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5 Discussion 


Variational inference has been considered a promising candidate for scaling 
Bayesian inference to real world datasets for some time. However, only with the 
advent of stochastic variational methods has this hope really started to become 
a reality. Alongside minibatch based SVI allowing improved scalability, by using 
Monte Carlo estimation SGVB can also allow a wider range of models to be eas¬ 
ily handled than standard VBEM. The only model specific derivation we require 
is the gradient of the log joint, which is no more than is required for LBFGS 


or HMG. Indeed, automatic differentiation tools such as Theano (Bastien et al. 


20121 could (and should!) be used to obtain these gradients. We have shown 


here that these ideas apply to sparse continuous latent variables represented us¬ 
ing a gamma variational posterior, as well as to Gaussian variables. In addition 
we have shown that the ability to easily handle non-conjugate likelihoods can 
have advantages in terms of inference: in particular that collapsing models can 
improve performance. While this is well known for Latent Dirichlet Allocation 
(|Blei et al.||2003|), leveraging this understanding has previously required careful 


model specific derivations (see e.g. Teh et al. ( 2006| )). An interesting potential 
line of future research would be to combine the ideas presented here with the 
variational autoencoder ( [Kingma fc Welling 20131 to allow scalable, nonlinear, 
sparse latent variable models, while additionally giving some ability to model 
posterior dependencies. 
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